home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Languguage OS 2
/
Languguage OS II Version 10-94 (Knowledge Media)(1994).ISO
/
language
/
condor
/
condor-w.arc
next >
Wrap
Text File
|
1994-10-18
|
27KB
|
621 lines
Subject: Condor mailing list
From: miron@chevre.cs.wisc.edu (Miron Livny)
Date: Tue, 29 Dec 92 19:39:23 -0600
We are proud to announce the birth of yet another mailing list -
the condor_world mailing list (condor-world@cs.wisc.edu).
As all of us know, mailing lists are a mixed blessing. We were
playing with the idea of starting such a list for more than a year.
So far we were able to control our desire to take advantage
of the unutilized capacity of the the inter-net to send Condor
related mail. However, the interest in Condor and the number
of active Condor pools have reached a point where we could not
say no any more. The bad part of the news is that not only did
we establish such a list, we included your name in the list.
You are on the list since at one time or other you expressed
interest in Condor, and (maybe) have asked to be on the Condor
mailing list. There is also a good part to the story, you can
get off the list. Just send a note to "owner-condor-world@cs.wisc.edu".
Condor_world should serve as a means to exchange information and
experience relating to Condor. We would also like to use it
as a channel for comments, wishes and complaints regarding the
system. As we continue to work on the problem of batch processing
in a cluster of workstation, we would like to know what *you* think
about the system. Maybe the best way to get started, would be for
each of you to tell us why and how you got involved with batch
processing on a cluster of workstations.
Mike Litzkow and Miron Livny.
=================================================================
Subject: Apology
From: condor (Mike Litzkow)
Date: Tue, 05 Jan 93 10:58:06 -0600
Dear Colleagues,
First I want to apologize to all of you who have been receiving
multiple copies of items sent to the "condor-world" mailing list.
We are working on the problem, and hopefully you will only get
one copy of this message.
Also I would like to point out that "condor-world" is an unmoderated
mailing list. Every message sent to "condor-world" is rebroadcast
to the whole list. This makes it inappropriate to send requests
to get on or off the list or notes about technical problems like
receiving multiple copies to "condor-world". Please use
"condor-world-request" for any messages you do not want to broadcast
to the whole group.
best regards,
-- mike
=================================================================
Subject: Condor on HP Snakes
From: condor (Mike Litzkow)
Date: Fri, 22 Jan 93 13:57:45 -0600
Dear Colleagues,
An alpha test version of Condor for HP 700 series machines (Snakes), is
now available on our ftp server "ftp.cs.wisc.edu". This code is
running on a few machines in our local environment, but is largely
untested.
Bugs
We do know of one "bug" already. HP executables submitted to
the condor pool from other platforms will not work becuase of
incompatibilities between the system call sets defined by HP-UX
and other UNIX variants. HP executables submitted from HP's
should work.
Hints for Building
For building condor, both the "imake" and the "cpp" which came
with your system should be fine. We don't recommend using the
versions supplied in the "imake_tools" directory. The shell
script "mdepend.sh" in the "GENERIC" directory will be
needed. Don't use the version of "makedepend" that might
have come with your system.
best regards,
-- mike
P.S. I will be at the USENIX conference in San Diego most of next week.
I will not be answering mail during that time, but if any of you plan
to be there and would like to look me up in person, please do so. I
am staying at the conference hotel.
=================================================================
Subject: Condor on Silicon Graphics Workstations
From: condor (Mike Litzkow)
Date: Mon, 01 Mar 93 11:54:00 -0600
Friends,
An "alpha" release of Condor for the Silicon Graphics workstations running
IRIX 4.X is now available. This has been tested at two sites on IRIX 4.0.5.F
systems, but will probably run on other IRIX 4 systems as well. This
release is called "Condor_4.1.irix.alpha" and is available by anonymous
ftp from "ftp.cs.wisc.edu" as usual. We are interested in feedback on
your experiences with building, installing, and using this system. If you
have problems, let me know and I will try to help - but please be patient.
We do not have an SGI machine here which we can place in our Condor pool
for extensive testing, so we are completely dependent on the good will
of others for this work. A few notes which should help with the build
process follow.
cheers,
-- mike
Notes:
1. Please make a "condor" user and place a "CONDOR" directory
in condor's home directory on your build machine. Extract
the tar file there.
2. You will need to use "imake" in the build process. You
should use the imake already on your system for this, don't
build it from my source. I think you will find this in
"/usr/bin/X11/imake". Imake will use "cpp" do part of its
work. The particular version of cpp used can be altered by
setting an environment variable called "IMAKECPP". The SGI
supplied "cpp" is fine, so don't set this environment
variable. The installation instructions tell you to set up an
alias for "imake". Make sure you do that.
3. You will need to use a "make depend" program in the build
process. Don't use the "makedepnd" supplied in the X
distribution. Use the shell script in the CONDOR/GENERIC
directory. You can do this by setting
#define MkDepend $(TOP)/GENERIC/mdepend.sh in your
"config/SGI_IRIX405.cf" file.
4. On SGI systems (and possibly others) you can use an environment
variable (SHELL) to control which shell will be used by "make".
The Condor Makefiles expect this to be the bourne shell "/bin/sh".
Either "unsetenv" this variable or set it to /bin/sh during
your Condor building.
==============================================================
Subject: Condor on Silicon Graphics
From: condor (Mike Litzkow)
Date: Wed, 26 May 93 09:31:27 -0600
Friends,
Our alpha test of Condor on Silicon Graphics 4.0.5 machines has turned up
a few problems. It seems that due to differences in the compiler technology,
the checkpointing mechanism works only on some versions of these
machines. Feedback so far indicates that the alpha code is working
on the IP7 and IP20 systems, but not on the IP12 and IP17s. You can
determine the type of system you have by running "uname -a". I am sorry,
but I don't know the mapping between the common names like "Indigo" and
"Crimson" and the "IP" designations.
The alpha code is still available from "ftp.cs.wisc.edu" for those of
you who can use it or would like to play with it. It is unlikely that
we will be able to produce an improved version soon.
A few folks have asked about running older versions of condor which had
some code for IRIX 3.3.1, but that was never official and I believe
converting it to work with the IRIX 4* systems would be a very big task.
regards,
-- mike
==============================================================
Subject: Sun Compatibility Problems
From: condor (Mike Litzkow)
Date: Fri, 04 Jun 93 16:20:42 -0600
Friends,
We have recently discovered some incompatibilities between executables
built on sparc 10's and other sparc based Suns. You can determine the
specific types of your Suns by running "uname -m". The sparc 10's will
say "sun4m" while the others will say either "sun4" or "sun4c". If all
of your Suns are of the same category, then the problems described here
won't affect you.
Problem 1:
Condor executables built on Sparc 10s cannot run and checkpoint
properly on other sparcs. Similarly, executables built on other Suns
will not run and checkpoint properly on Sparc 10 systems. The cause is
a difference in where the user stack is placed in memory on the two
kinds of machines. The usual symptom is that the user process dies
with a segmentation fault, (signal 11).
If you need to run both sparc 10's and other sparcs in the same Condor
pool, you will need to arrange for Condor to view them as two different
machine architectures. This can be done by changing the "ARCH" macro
in your "condor_config" or "condor_config.local" files. I would
suggest setting ARCH to "sun4m" on sparc 10's and "sun4" on the
others. Also the condor libraries will need to be compiled separately
for the two varieties of machines and distributed in a way which will
make the appropriate libraries visible on the proper machines.
I believe it is possible to submit jobs to run on "sun4m" machines
from "sun4" machines or the reverse. The critical point is that the
jobs are linked with a condor library which has been built on the
same type of machine where you want them to run.
Problem 2:
Some of the Condor daemons cannot share executable files between the
two kinds of machines. This is due to a difference in the
implementations of the "kvm" library on the two platforms, and affects
not only Condor, but other system programs like "top" and "ps". We
know that the "condor_startd" is affected, but other Condor daemons may
be involved as well. I recommend compiling and distributing completely
separate sets of Condor executables for the two machine types.
The good news is that the incompatibilities exist only at the object
code level. No source code changes are needed.
best regards,
-- mike
==============================================================
Subject: Condor and Unsupported System Calls
From: condor (Mike Litzkow)
Date: Sun, 20 Jun 93 14:42:17 -0600
Friends,
There have been a couple of postings to the group lately regarding programs
that do system call which aren't supported by Condor. There are a fair
number of such sysem calls - pipe, fork, exec, socket, and ioctl to name
a few. Quite often the programmer is unaware of using the taboo calls
because they are really being executed by some library routine, and
the programmer isn't aware that the call is being made until the Condor
job fails to run.
In most cases when a user program attempts such a system call, Condor
will place an error message in the program's standard error file, and
then should terminate the job (more on this later). In some cases
though, the system call will simply return an error status, and the
user program is free to react as it wants. We do it this way because
some calls are exercised very frequently by Fortran run time libraries,
and could never complete if Condor terminated them. An example of
such a call is sigvec(). Most fortran programs will run quite fine
under Condor even though at the beginning they do about 20 sigvecs
which all fail! In the case of other calls like fork and exec, we feel
it is best to let the user know right away that this program cannot
be run with Condor.
It turns out that Condor_4.1.3b (and possibly other versions as well),
has a bug in how the "job terminating" illegal system calls are handled.
The bug is in the "condor_shadow" program, and a fix is attached. The
bug leads to the program being tried over and over by Condor, even
though it can never run. In many cases the erroroneous system call
occurs at the very beginning of the program, which means it will execute
only a short time before encountering the problem. The user runs
"condor_q" many times, but never sees the job in the "Run" state, and
so concludes that Condor is refusing to run the job for some unknown
reason. I recommend that everyone apply the fix because of the large
potential for confusion.
regards,
-- mike
=================================================================
The bug is in the "condor_shadow.c" file around line 466. The sequence
if( read(pipe,&msg_type,sizeof msg_type) != sizeof msg_type ) {
EXCEPT( "Could not read msg_type from child shadow" );
}
should be
if( read(pipe,&msg_type,sizeof msg_type) != sizeof msg_type ) {
/* the child probably has died */
break;
}
==============================================================
Subject: Bug Fix
From: condor (Mike Litzkow)
Date: Tue, 06 Jul 93 10:27:52 -0600
Friends,
We have discovered a bug in the "condor_schedd". This bug only affects
installations where the condor "log" directory is remotely mounted
via NFS. (All diskless machines will be in this category, but machines
with disks could be set up this way too.)
Description:
In such an istallation the Condor deamons need to be able to write
to their respective log files, and should do so by running with an
effective uid of "condor" during all logging operations. In fact all of
the daemons are intended to run with an effective uid of "condor" at
all times except in those few instances when local "root" permission
is required. This is because at most NFS installations, remote accesses
to files by "root" will be handled as if they were attempted by the
unprivileged user "nobody". The daemons do need to run with their
euid's set temporarily to "root" at certain times, for example when sending
signals to local processes. When necessary, the euid is switched between
"root" and "condor" with a pair of routines called "set_condor_euid" and
"set_root_euid". The bug in the "condor_schedd" causes it to run
with an euid of "root" all the times, and it is therefore not able to
access remotely mounted log files.
Fix:
The routines "create_job_queue" and "mark_jobs_idle" both have calls to
"set_condor_euid" near the top and "set_root_euid" near the bottom.
All 4 of these calls should be eliminated.
best regards,
-- mike
=============================================================
Subject: Sun Checkpointing Problem
From: Mike Litzkow <condor@goya.cs.wisc.edu>
Date: Thu, 22 Jul 93 11:31:10 -0500
Friends,
Description:
Some folks have been having a problem with Condor's being unable to
produce a "core" file on sun4m (Sparc-10) systems. This prevents
the job from checkpointing, and has the symptom that the job's
"image size" inexplicably jumps to some unreasonably large value.
Once the image size grows very large, Condor will not be able to
find any machines where the job can run, so it will sit in the
queue forever. So far, the problem has been reported on Sparc-10
systems only, but we are not sure whether it may affect other
Sun systems as well.
Detailed Discussion:
The problem is related to Sun's use of "holes" in their core files. These
are large areas in the file which appear as all zero's when the file is
read, but actually take up no space on the disk. This means that the core
file's virtual size is different from its actual size. In particular the
virtual size of the stack segment that is dumped is the same as the stack
size *limit* in force at the time of the core dump. The actual size of the
dumped stack segment depends on the actual number of pages mapped into the
stack segment at dump time.
The condor starter wants to allow its user processes to use as many
resources as they desire, so it sets all the limits (including the
STACK) to "infinity". When it comes time to dump the core, the kernel
must decide whether there is sufficient free disk, but apparently it
(incorrectly) uses the virtual rather than the actual size of the stack
for this calculation. The result is no core file ever appears. The
condor_starter then concludes that the core file didn't appear because
there was insufficient disk, and adjusts the process's image size
up to the amount of free disk which exists at the time.
Workaround:
The ultimate solution to this problem can only come from Sun, but
you can get things working in most cases by setting a more modest
limit on the user process's stack size. For most processes 8 megabytes
is a reasonable value to use. If you choose this figure, then
any user process which needs more than 8 meg of stack will crash. Also,
checkpointing will still require something over 8 meg of free disk
and will fail on machines with less than that. To set the limit,
change the code in the file "condor_starter/starter.c" as follows:
The line
limit( RLIMIT_STACK, RLIM_INFINITY );
becomes
limit( RLIMIT_STACK, 8 * 1024 * 1024 );
best regards,
-- mike
=============================================================
Subject: Condor for HP-UX 9 Available
Date: Mon, 15 Nov 93 10:19:18 -0600
From: condor
Hello,
A port of CONDOR to the HP PA-RISC machines running HPUX 9.01
is now available, and replaces the previous alpha (broken) version.
You can grab a source code tar file via anon ftp from:
ftp.cs.wisc.edu:/condor/Condor_4.1.hp700source.beta1.tar.Z
We hope to soon have another file available which contains
just the HP PA-RISC binaries, to save compiling hassles for folks.
Once its available, we'll send another message.
Below is a copy of the README.1ST file from the
HP PA-RISC condor file, which details what has been fixed.
I would like to thank everyone who helped out with getting
Condor running on the PA-RISCs. The steady supply of
bug reports and suggested fixes was very helpful. I would
like to especially thank Bret McKee at HP, who helped us
with PA-RISC address queue registers and thus got checkpointing
working properly.
-Todd Tannenbaum, just a guy dealing with condor on HP PA-RISCs
Director of Model Advanced Facility [MAF]
UW-Madison Computer Aided Engineering Center
Here is the README file.
README.1ST:
November 11, 1993
What you have here is the beta.1 source code to Condor 4.1 for the
HP 9000/700 series of workstations running HPUX 9.01.
What is different about this release -vs- the earlier alpha release
of Condor for the HP 700?
- Checkpointing now works properly with no segmentation faults :-)
- This version is written for and tested on HPUX 9.01 only. It has
never been tested on HPUX 8.07, although I _think_ it will work
after a few minor changes to get it to compile (and a few changes to
some of the Imakefiles). Again, don't let the #defines and HPUX8
references everywhere fool you.... this release is for HPUX9.01.
- Lots of nonobvious bugs and strange behaviors under HPUX found & fixed.
- Remote time usage reporting should now work correctly
- The MEMORY config file parameter is no longer needed. Condor will now
figure out the amount of RAM installed by reading it out of the kernel.
The MEMORY parameter is only used as a fallback in case of an error.
What still needs to be done to the HP 700 port of Condor?
- Submitting jobs from a platform other than an HP does not work yet.
You must submit your job from an HP 700 in order for it to run properly
on a condor pool of HP 700s. As it turns out, very few Condor sites
care about cross-platform job submitting. I'll hopefully have this
working soon.
- None of the documentation/man pages have been updated yet.
- Although there are #defines everywhere for other platforms, this source
tree will only compile on HPUX. We are beginning to work on folding
the HP source code back into the main Condor platform-independent source.
- The amount of free swap space is still calculated incorrectly. This
is used for optimization, and is not _required_, per se. Expect it to be
fixed in the next release.
Is there an easier way to figure out how to compile my job for condor?
Here are a few ideas:
(1) try out the condor_compile command, located in the condor_compile
subdirectory. Read the README located there. After installing
condor_compile, you can compile for condor by typing :
condor_compile <whatever you normally type to create an executeable>
For instance, if you normally compile by typing "f77 +e +O3 myprog.f",
typing "condor_compile f77 +e +O3 myprog.f" will result in a
condor executeable called a.out.condor. (condor_compile will append
a ".condor" to whatever your executeable name normally comes out as).
You can type "condor_compile make myprogram", or "condor_compile
cc .....", whatever. condor_compile currently works on HPUX and
on SunOS 4.1.x. The whole idea behind condor_compile is a rather
ugly hack, but condor end-users who just want to use Condor without
a lecture on linking love it.
(2) HPUX9.01 supports the "-v" option on most compilers, which displays
all the options being passed to ld. Link for condor with the
exact same options you see when you compile with "-v", but replace
crt0 and -lc with the condor versions.
(3) Examine the Makefile in the test suite directories.
Enjoy! We currently have about 80 HP 700s in our pool, and it blows the
doors off of our old Sun SPARC 1 pool. Happy crunching.
Todd Tannenbaum
Director of the Model Advanced Facility (MAF)
University of Wisconsin-Madison Computer Aided Engineering Center
Questions/comments/problems/bugs with CONDOR in general?
send internet email to: condor@cs.wisc.edu
Questions/comments/problems/bugs *specific to the HP 700 port* of CONDOR?
send internet email to: tannenba@engr.wisc.edu
=============================================================
Subject: Condor Job Termination Reports
Date: Thu, 03 Mar 94 11:21:24 -0600
From: condor
A number of folks have asked questions regarding the meanings of the
various job termination messages generated by Condor. Often folks have
thought that the exit status and termination signal numbers are
generated by Condor, and have asked "where in the Condor documentation
are these listed?". In fact Condor is only reporting to you the
information about how your job terminated which is made available by
the underlying operating system (Unix).
Following are a few tips on understanding process termination which
may be helpful to those of you not already intimate with these
details.
best regards,
-- mike
To understand this information, you first need to know that every Unix
process will terminate in one of two ways - "normally", or
"abnormally".
Normal termination
A process is said to terminate "normally" when it calls the exit()
function, or when the function main() returns. In either case it
is possible for the application programmer to provide a number
called the "exit status". If your program terminates by calling
exit(), then the status is an integer argument to that function.
If your program terminates by reaching the end of main(), then the
status is the return value of that function. For example:
exit( 0 );
or
return 0;
One aspect which is sometimes confusing is what happens if your
application fails to provide a status value at exit time by calling
exit() with no arguments, reaching the end of main() with a return
statement with no value, or reaching the end of main() with no
return statement. In such a case, the exit status is "undefined" -
in other words some value will be reported, but it is meaningless.
Another aspect which is sometimes confusing is the size of the exit
status. In general only 8 bits are allowed for this purpose. On
most platforms you can then think of the exit status as an unsigned
char, i.e. it can only hold values [0 - 255]. A common mistake is
calling "exit( -1 )" in case of an error. The exit status in this
case will be reported as 255!
When your program exits normally the message from Condor will
look something like
Your Condor job
a.out arg_1 arg_2 arg_3
exited with status 73.
In such a case, your system administrator cannot answer the
question "what does exit status 73 mean?". That is the exit status
returned by your code, which as discussed above, may or may not be
meaningful.
There are a couple of special cases to consider. First you should
realize that your program contains a mixture of your own code,
Condor supplied code, and other library code. If some unexpected
event causes your program to exit while executing the Condor
supplied code, the exit status is generally 4. Also some versions
of Condor take the exit status 255 (remember -1), to have special
meaning. We recommend that your code always provide a meaningful
exit status, and that the values 4 and 255 not be used for this
purpose. (It is traditional to return a status of zero when a
program terminates correctly, and non-zero when it exits with an
error.)
Finally, you may wonder why you don't see these "exit status"
numbers when you run your job outside of Condor. In fact they do
exist, and can be found in the shell variable $status, which is
updated after every shell command. For example
ls /usr; echo $status
will run the "ls" command, and then print its exit status.
Abnormal Termination
A program is said to have terminated "abnormally" if it is killed
by being sent one of a set of signals which causes termination, and
that signal is not being blocked, caught, or ignored. In many
cases these "terminating" signals will also cause a core file to be
generated. Such an untimely death can happen to both Condor and
non-Condor processes, and is generally reported by the shell - for
example the ever popular
Bus error (core dumped)
Note that the core dump will not happen if you have set a
"coredumpsize" limit too small to allow it, or if your file system
doesn't have enough space.
When your Condor process is killed by such a signal the message
will look something like
Your Condor job
a.out arg_1 arg_2 arg_3
was killed by signal 10.
There may also be a clause telling you that you have a core dump,
and the name of the core file. The core file will not appear if
there was insufficient file system space on either the executing
machine or your machine, or you have asked not to get core files
(see condor_submit(1) for details). In this case Condor reports
the signal as "number 10", but does not translate the number to the
string "Bus error". This is because such translation is generally
not portable across various Unix implementations. The meanings of
the signals are defined in the header file <signal.h>.
Note that the core file may be useful in determining what caused
the untimely death of your process, and in particular whether it
was executing your code or Condor code at the time of the event.
It would thus be bad form to remove the core right before asking
your Condor system administrator for help in determining what went
wrong.